{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np",
    "\n",
    "import theano",
    "\n",
    "import theano.tensor as T",
    "\n",
    "import lasagne",
    "\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generate names\n",
    "* Struggle to find a name for the variable? Let's see how you'll come up with a name for your son/daughter. Surely no human has expertize over what is a good child name, so let us train NN instead.\n",
    "* Dataset contains ~8k human names from different cultures[in latin transcript]\n",
    "* Objective (toy problem): learn a generative model over names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_token = \" \"",
    "\n",
    "\n",
    "with open(\"names\") as f:",
    "\n",
    "    names = f.read()[:-1].split('\\n')",
    "\n",
    "    names = [start_token+name for name in names]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('n samples = ', len(names))",
    "\n",
    "for x in names[::1000]:",
    "\n",
    "    print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text processing\n",
    "First we need next to collect a \"vocabulary\" of all unique tokens i.e. unique characters. We can then encode inputs as a sequence of character ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# all unique characters go here",
    "\n",
    "token_set = <YOUR CODE: a list of all unique characters in names, including space >",
    "\n",
    "\n",
    "tokens = list(token_set)",
    "\n",
    "print('n_tokens = ', len(tokens))",
    "\n",
    "assert 54 < len(tokens) < 56"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Theano is built for numbers, not strings of characters.\n",
    "We'll feed our recurrent neural network with ids of characters from our dictionary.\n",
    "\n",
    "To create such dictionary, let's assign each character with it's index in tokens list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_to_id = <dictionary of symbol -> its identifier(index in tokens list) >",
    "\n",
    "\n",
    "id_to_token = < dictionary of symbol identifier -> symbol itself >"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt",
    "\n",
    "%matplotlib inline",
    "\n",
    "plt.hist(list(map(len, names)), bins=25)",
    "\n",
    "\n",
    "# truncate names longer than MAX_LEN characters.",
    "\n",
    "MAX_LEN = min([60, max(list(map(len, names)))])",
    "\n",
    "# ADJUST IF YOU ARE UP TO SOMETHING SERIOUS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cast everything from symbols into identifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names_ix = list(map(lambda name: list(map(token_to_id.get, name)), names))",
    "\n",
    "\n",
    "\n",
    "# crop long names and pad short ones",
    "\n",
    "for i in range(len(names_ix)):",
    "\n",
    "    names_ix[i] = names_ix[i][:MAX_LEN]  # crop too long",
    "\n",
    "\n",
    "    if len(names_ix[i]) < MAX_LEN:",
    "\n",
    "        names_ix[i] += [token_to_id[\" \"]] * \\",
    "\n",
    "            (MAX_LEN - len(names_ix[i]))  # pad too short",
    "\n",
    "\n",
    "assert len(set(map(len, names_ix))) == 1",
    "\n",
    "\n",
    "names_ix = np.array(names_ix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Input variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agentnet import Recurrence",
    "\n",
    "from lasagne.layers import *",
    "\n",
    "from agentnet.memory import *",
    "\n",
    "from agentnet.resolver import ProbabilisticResolver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sequence = T.matrix('token sequence', 'int64')",
    "\n",
    "\n",
    "inputs = sequence[:, :-1]",
    "\n",
    "targets = sequence[:, 1:]",
    "\n",
    "\n",
    "\n",
    "l_input_sequence = InputLayer(shape=(None, None), input_var=inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build NN\n",
    "\n",
    "You'll be building a model that takes token sequence and predicts next tokens at each tick\n",
    "\n",
    "This is basically equivalent to how rnn step was described in the lecture"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# One step of rnn",
    "\n",
    "class step:",
    "\n",
    "\n",
    "    # inputs",
    "\n",
    "    inp = InputLayer((None,), name='current character')",
    "\n",
    "    h_prev = InputLayer((None, 10), name='previous rnn state')",
    "\n",
    "\n",
    "    # recurrent part",
    "\n",
    "    emb = EmbeddingLayer(inp, len(tokens), 30, name='emb')",
    "\n",
    "\n",
    "    h_new =  # YOUR CODE: concat emb and h_prev and feed them to DenseLayer. Everything must be lasagne layers",
    "\n",
    "\n",
    "    # YOUR CODE: compute probabilities for next tokens, should also be lasagne layer that uses h_new",
    "\n",
    "    next_token_probas =",
    "\n",
    "\n",
    "    # pick next token from predicted probas",
    "\n",
    "    next_token = ProbabilisticResolver(next_token_probas)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_loop = Recurrence(",
    "\n",
    "    state_variables={step.h_new: step.h_prev},",
    "\n",
    "    input_sequences={step.inp: l_input_sequence},",
    "\n",
    "    tracked_outputs=[step.next_token_probas, ],",
    "\n",
    "    unroll_scan=False,",
    "\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model weights",
    "\n",
    "weights = lasagne.layers.get_all_params(training_loop, trainable=True)",
    "\n",
    "print(weights)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted_probabilities = lasagne.layers.get_output(",
    "\n",
    "    training_loop[step.next_token_probas])",
    "\n",
    "# If you use dropout do not forget to create deterministic version for evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "loss =  # <Loss function - a simple categorical crossentropy will do, maybe add some regularizer>",
    "\n",
    "\n",
    "updates = lasagne.updates.adam(loss, weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compiling it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# training",
    "\n",
    "train_step = theano.function([sequence], loss,",
    "\n",
    "                             updates=training_loop.get_automatic_updates()+updates)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# generation\n",
    "\n",
    "here we re-wire the recurrent network so that it's output is fed back to it's input. \n",
    "\n",
    "We also make sure to feed id of `\" \"` as initial token."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_steps = T.scalar(dtype='int32')",
    "\n",
    "x0 = InputLayer([None], theano.shared(np.int32([token_to_id[' ']])))",
    "\n",
    "\n",
    "feedback_loop = Recurrence(",
    "\n",
    "    state_variables={step.h_new: step.h_prev,",
    "\n",
    "                     step.next_token: step.inp},",
    "\n",
    "    tracked_outputs=[step.next_token_probas, ],",
    "\n",
    "    state_init={step.next_token: x0},",
    "\n",
    "    batch_size=theano.shared(1),",
    "\n",
    "    n_steps=n_steps,",
    "\n",
    "    unroll_scan=False,",
    "\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generated_tokens = get_output(feedback_loop[step.next_token])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_sample = theano.function(",
    "\n",
    "    [n_steps], generated_tokens, updates=feedback_loop.get_automatic_updates())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_string(length=MAX_LEN):",
    "\n",
    "    output_indices = generate_sample(length)[0]",
    "\n",
    "\n",
    "    return ''.join(tokens[i] for i in output_indices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate_string()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model training\n",
    "\n",
    "Here you can tweak parameters or insert your generation function\n",
    "\n",
    "\n",
    "__Once something word-like starts generating, try increasing seq_length__\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_batch(data, batch_size):",
    "\n",
    "\n",
    "    rows = data[np.random.randint(0, len(data), size=batch_size)]",
    "\n",
    "\n",
    "    return rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "print(\"Training ...\")",
    "\n",
    "\n",
    "\n",
    "# total N iterations",
    "\n",
    "n_epochs = 100",
    "\n",
    "\n",
    "# how many minibatches are there in the epoch",
    "\n",
    "batches_per_epoch = 500",
    "\n",
    "\n",
    "# how many training sequences are processed in a single function call",
    "\n",
    "batch_size = 10",
    "\n",
    "\n",
    "\n",
    "for epoch in xrange(n_epochs):",
    "\n",
    "\n",
    "    avg_cost = 0",
    "\n",
    "    for _ in range(batches_per_epoch):",
    "\n",
    "\n",
    "        avg_cost += train_step(sample_batch(names_ix, batch_size))",
    "\n",
    "\n",
    "    print(\"\\n\\nEpoch {} average loss = {}\".format(",
    "\n",
    "        epoch, avg_cost / batches_per_epoch))",
    "\n",
    "\n",
    "    print(\"Generated names\")",
    "\n",
    "    for i in range(10):",
    "\n",
    "        print(generate_string(),)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# And now,\n",
    "* try lstm/gru\n",
    "* try several layers\n",
    "* try mtg cards\n",
    "* try your own dataset of any kind"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
